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ABSTRACT 
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process is out of control. The statistics proposed in this study, however, 
are not standard normally distributed, and therefore, in this study, 
boundaries were determined using simulated data. The study investigated 
whether the numerical values of the upper and lower threshold were dependent 
on the latent trait, theta. Results show that the upper and lower bound were 
rather stable across theta-levels. However, the height of the boundaries was 
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Executive Summary 

A person's item response pattern may reveal such undesirable behavior as faking responses on 
personality tests, guessing, or knowledge of the correct answers due to test preview. These undesirable 
behaviors may result in inappropriate test scores and may have serious consequences for practical test use, 
for example, they could result in decision errors in testing for job and educational selection. 

In analyzing Law School Admission Test (LSAT) data operationally, item response theory (IRT) is 
applied. IRT is a mathematical model whereby the probability that a test taker will answer an item correctly is 
related to the ability level of the test taker and the characteristics of the item. To detect item response patterns 
that do not fit regular test taking behavior under an IRT model several person-fit statistics have been proposed. 
Nearly all statistics are based on the difference between the observed item scores and those expected under the 
IRT model. When the distribution of a statistic under the hypothesis of regular response behavior is known, 
item response patterns can be classified as fitting or nonfitting the model. To date, almost all fit statistics have 
been proposed for conventionally administered paper-and-pencil (P&P) tests. With the advent of computerized 
adaptive testing (CAT), research is needed with respect to the application of person-fit statistics in CAT. In 
earlier research, tihe empirical distributions under CAT of a frequently used fit statistic, and an adaptation 
were studied. These authors found that for simulated P&P data the empirical distribution of was more in 
agreement with the standard normal distribution than the distribution of For CAT data, however, there was 
a large discrepancy between the empirical and standard normal distribution for both statistics. Consequently, 
inadvertent application of these person-fit tests to CAT data may result in inaccurate decisions. 

In this paper, several new fit statistics especially designed for CAT were studied. These statistics were 
based on the ^eory of statistical process control (SPC). In industrial applications, a (production) process is in a 
state of statistical control if the variable being measured has a stable distribution. One technique from SPC is 
based on Shewhart control charts. These charts are used to determine if a process is in statistical control by 
examining past data. An example of a Shewhart control chart is the X-chart, where the observed averages of 
the variable being measured in a sample of size (n) are calculated over time. X-charts are very effective in 
detecting large shifts in the quality of a production process. Another technique from statistical process control 
is the cumulative sum procedure. This paper shows that this procedure could be successfully applied to detect 
test takers in a CAT whose response behavior is not in accordance with the IRT model. For example, suppose 
an examine becomes more and more careless during the test because of fatigue. As a result, in the first part of 
the CAT responses will tend to alternate between correct and incorrect reponses, whereas in the second part of 
the test more and more items are answered incorrectly due to carelessness. The statistics studied in this paper 
were designed to be sensitive to such changes in response behavior. The power of these statistics with respect 
to a large variety of nonfitting response behaviors was determined. Recommendations on the appropriate 
choice from these statistics for CAT programs were made. 

Abstract 

In this study a cumulative-sum or CUSUM-procedure, from the theory of Statistical Process Control was 
modified and applied in the context of person-fit analysis in a computerize adaptive testing (CAT) environment. 
Six person-fit statistics were proposed using the CUSUM-procedure; three of which can be used to investigate the 
CAT in online test taking. The usefulness of these statistics was explored in a small simulation study. In most 
CUSUM-procedures standard normally distributed statistics are used; based on this normality, boundaries can be 
determined to decide when a process is out-of-control. The statistics proposed in this study, however, are not 
standard normally distributed and therefore, in this study, boundaries were determined using simulated data. 
This study investigated whether the numerical values of the upper and lower threshold were dependent on the 
latent trait, 6. Results showed that the upper and lower bound were rather stable across ^-levels. However, the 
height of the boundaries was dependent on the particular statistic chosen. 

Introduction 

The development of group ability tests has enabled the comparison of an individual's total test score with 
the scores of a population norm group, thus allowing for more meaningful interpretation of the ability level. 
However, a person's test score may not reveal the operation of other imdesirable influences of test taking be- 
havior such as faking on biodata questionnaires and personality tests, guessing, or knowledge of the correct 
answers due to test preview These and other influences may result in inappropriate test scores which may 
have serious consequences for practical test use, for example, in job and educational selection, where classifi- 
cation errors may result. 
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To detect item response patterns that do not fit an item response theory (IRT) model several person-fit 
statistics have been proposed (Drasgow; Levine, & Williams, 1985; Molenaar & Hoijtink, 1990; Kianer & Rettig, 
1990). In almost aU statistics, for an individual person with a latent trait value 6 the difference between the 
observed and expected item score on the basis of an IRT model is compared across items. When the distribution 
of a statistic is known under a null model, item score patterns can be classified as fitting or nonfitting. 

To date, almost all fit statistics have been proposed in the context of conventionally administered tests or 
paper-and-pencil (P&P) tests. With the increasing use of computerized adaptive testing (CAT), additional 
research is needed with respect to the application of person-fit statistics in CAT. Recently, Meijer and van 
Krimpen-Stoop (1998) investigated the empirical distribution of an often used fit statistic, (Drasgow et al., 
1985) and an adaptation I* (Snijders, 1998) that correct for the use of the estimated latent trait value, 6, 
instead of 0 in 1^. Both statistics were assumed to be standard nomally distributed. Meijer and van 
Krimpen-Stoop (1998) found that for simulated P&P data when 6 was used instead of 6 the empirical 
distribution of 1^* was more in agreement with the standard normal distribution than the distribution of 1^. 
For CAT data, however, there was a large discrepancy between the empirical and theoretical distribution for 
both statistics. Consequently, decisions about the fit of a score pattern on the basis of theoretical critical 
values will be inaccurate. As an alternative, Meijer and van Krimpen-Stoop (1998) proposed to simulate the 
distribution for a given 6 through parametric bootstrapping. Given a fixed 6 value, P&P and CAT response 
vectors were generated and the distribution of the significance probabilities was determined on the basis of 
0. For P&P tests the results were promising in the sense that the significance probabilities were in agreement 
with the expected percentages. However, for CAT and 6 the probabilities were too low, which hamper the 
use of these statistics in a CAT environment. 

This study proposes new fit statistics especially designed for a CAT. These statistics are based on the 
theory of statistical process control as explained below. The use of these statistics will be explored by means 
of two small simulation studies. 



Statistical Process Control 

A (production) process is in a state of statistical control if the variable being measured has a stable 
distribution. The analysis of determining whether a process is in statistical control is often called statistical 
process control (SPC). One technique from SPC is using a Shewart control chart, originally proposed by 
Shewart (1931); these charts are used to determine if a process is in statistical contrqj. by examining past data. 
An example of a Shewart control chart is the X-chart, wherejhe observed averages (X) of the variable being 
measured in a sample of size n are measured over time. An X-chart is very effective in detecting large mean 
shifts in the production process. However, a disadvantage of Shewart control charts is that the chart only uses 
the information about the process contained in the last sample taken; the information of the entire sequence of 
samples is ignored. As a result, Shewart charts are rather ineffective in detecting small shifts in the process. A 
technique from SPC that is more effective in detecting smaller shifts in the mean is the cumulated sum 
(CUSUM) procedure, originally proposed by Page (1954). In a CUSUM-procedure, sums are accumulated, but 
a value of the statistic, obtained from a sample of size n, is only accumulated if it exceeds "the goal value" by 
more than k units. Suppose that Z, is the value of statistic Z obtained from a sample at time i, k is the reference 
value, and h is some threshold. Then, the two-sided CUSUM procedure can be written in terms of S, and S, , 
where 



Sf = max [0, Zj - k] 

S” = max [0, (Zj -k) + (Z2 - k)] 

= max [0, (Z2 -k)+ Sf ] 

S” = max [0, (Z3 - k) + S” ] 

best COPY avaiuble 

sf = max [0, (Z, -fc) + S” ,1, (1) 



and analogously 
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Sf = mm[0, {Zi + fc) + Sf., 



( 2 ) 



with starting values = Sq = 0. Note that the cumulations can be running on both sides concurrently. 

The sum of consecutive positive values of is reflected by and the sum of consecutive negative values 
of Zi + fc by Sf . Thus, as soon as | Z,- 1 >k the CUSUM-chart starts. The process is in an "out-of-control" state 
when >h or < -h and "in-control" otherwise. This means that, after a number of consecutive positive or 
negative values of the statistic, the process can become out-of-control. One assumption underlying the 
CUSUM procedure is that the Z, -values computed are approximately standard normally distributed; the 
values of k and h are based on this assumption. The value of k is usually selected as one half of the mean 
shift (in Z, -units) one wishes to detect; for example, k = 0.5 is the appropriate choice for detecting a shift of 
one times the standard deviation of Z^. 

In practice, CUSUM charts with k = 0.5 and = 4 or = 5 are often used (see for example, Montgomery, 
1991, p. 295). 

Computerized Adaptive Testing, Person Fit, and Statistical Process Control 



IRT models describe the probability of a correct response to an item, as a function of the item and person 
parameters. In this study we use the two-parameter logistic IRT model (2-PLM) because it is less restrictive 
with respect to empirical data than the one-parameter logistic model and it does not have the estimation 
problems of the guessing parameter in the three-parameter logistic model (e.g.. Baker, 1992, pp. 109-112) The 
2PLM has been shown to have a reasonable fit to achievement and personality data (e.g., Reise & Waller, 
1990; Zickar & Drasgow, 1996). 

Let Xj^j be the binary (0,1) response of test taker V to item; (j = 1, ..., /), where 1 denotes a correct or keyed 
response, and 0 denotes an incorrect or not keyed response. Further, let fly denote the item discrimination 
parameter, bj the item difficulty parameter, and 6^ the latent trait value of test taker v. Then the probability 
of correctly answering an item according to the 2-PLM can be written as 



Pvi=Pi(&.) = 



exp[fl,(g„-b;)] 
l + exp[n,.(0„ -b.)]' 



(3) 



For a P&P test, the administered test consists of the same items for all test takers; in a CAT, items are 
selected based on the responses to previous items, and an item is selected from a large pool of items to adapt 
to the ability of the test taker. As a result, test takers with high 0-values respond to more difficult items than 
test takers with low 0-values and, especially at the end of the test, the probability of a correct answer to the 
selected item is close to 0.5 for all test takers. An important implication of this selection process is that the 
response patterns of normal responding test takers consists of an alternation of 1 and 0 scores. 

To investigate a test taker's fit to an IRT model in almost all person-fit statistics the difference between 
the observed and expected item score is compared across items. A general form in which most person-fit 
statistics in the context can be expressed is 






(4) 



where the statistic is of the centered form, that is, the expected value of the statistic equals 0 and w ^ (0) 
denotes a suitable weight: often the variance of an item score is taken into account to obtain a standardized 
version of the statistic. A person-fit statistic is said to be standardized when the distribution is the same 
across 0 values. 

For example, Wright and Stone (1979) proposed a person-fit statistic based on standardized residuak, 
where the weight 

Zfl,(0)= ;r 
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was taken resulting in 



M]p,id){i-p,(d)) 



(5) 



LI can be interpreted as the mean of the squared standardized residuals based on / items. 

One of the drawbacks of U is that negative and positive residuals cannot be distinguished. In a CAT this 
is interesting because a string of negative or positive residuals may indicate aberrant behavior. For example, 
suppose a test taker with an average 0-value responds to a test and during the test the test taker becomes 
more and more careless because he/she becomes tired. As a result, in the first part of the test the responses 
will be an alternation of zeros and ones, whereas in the second part of the test more and more items are 
answered incorrectly due to carelessness; thus, in the second part of the test, consecutive negative residuals 
will occur. 

The sum of consecutive negative or positive residuals can be investigated by slightly changing the 
CUSUM-procedure described above. A CAT can be viewed as a multistage test, where each item is a stage. 
Let i denote the stage of the CAT. Further, let the statistic T, be a function of the residuals at stage i, 
the final test length for test taker v, and ; the item selected from an item pool Q„ , containing n items; that is, 
i = 1, ..., and; e (1, 2, ..., n}. Then, for test taker v per stage i of a CAT 



sf = 


max [0, (T, -fc) -i-S”i], 


(6) 


s,*- = 


min [0, (Tj + k) + and 


(7) 


So" = 


Si =0, 


(8) 



where and reflect the sum of consecutive positive and negative r^iduals, respectively. Let UB and LB be 
some appropriate upper and lower bound, respectively, then, when S > UB or S < LB the response pattern 
can be classified as nonfitting the model. 

New Person-Fit Statistics 



In this section several person-fit statistics are proposed that are especially developed for a CAT and are 
based on the CUSUM-procedure discussed above. 

In a CAT, 0„ is estimated in each stage i based on the response given up to that stage. Let 0„,- denote^ 
the estimated 9„ at stage i, then on the basis of 0^, the item for the next stage, i -i- 1, is selected. Let 0„ = 0„„,^ 
denote the final estimate of 0„ , the binary (0, 1) response of test taker v to item ; at stage i, and the set 
of items administered to test taker v up to and including stage f. Then, the probability of answering item; 
correctly, evaluated at 0„, can be written as 



= p(x,. =1 1 0„,) = ^ . 

'' l-i-exp[fl^ (0„ -bj] 



(9) 



For notational convenience the index v will be dropped below. 

To take the sequential nature of a CAT into account, the probability of a correct answer to the administered 
item ;■ at stage i can be evaluated at 0, .j; that is, P,.j ^ where 0 from the previous stage is used. Define 




= X, -P,.,,, 



( 10 ) 

( 11 ) 
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(12) 
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where 



I' = (13) 

/ceArt ^ *eA„ 

That is, t is the test information function according to the 2-PLM, of a test contami^ the items 
administered up to and including stage i, evaluated at the estimated 6 at stage i. Thus, is the residual of 
the response of test taker v at item; at stage i and the probability of a correct response to item; at stage f, 
evaluated at the estimated ability at the previous stage and T. and are functions of the residuals. The 
statistics 7^, 7^, and 7^ are proposed to investigate the sum of consecutive positive or negative residuals, if 
desired in an on-line situation. 

Define 



O 

ERIC 



where 



Tf = 

V 






(14) 


rp 5 

Ifj - 


X 

! 


)] Land 


(15) 


»T» 6 

i - 


— ■■ 

X 

1 

1 




(16) 




^ JteA, 



(17) 



7 ' is the test information function according to the 2-FLM, of a test containing t^ items administered up 
to and including stage /, evaluated at the final estimated 0. The statistics 7^, 7^, and i are proposed to 
investigate the sum of consecutive negative or positive residuals, evaluated at the final estimate of 0^,. As a 
result of using 6 instead of the development of the accumulated residuals can no longer be investigated 
in an online situation. 

To determine upper and lower boimds in a CUSUM-procedure it is assumed that the statistic computed 
at each stage is approximately standard normally distributed. However, the distributions of through 7^ 
are far from standard normal; and 7^ follow a binomial distribution with only one observation and the 
other statistics are standardized versions of and 7^, and also based on only one observation. Therefore, 
setting k = 0.5 and the upper and lower bound to 5 and -5, respectively, might not be appropriate. As has 
been stated in Ryan (1989, p.l40), the value of h could be adjusted to compensate for this (i.e., lack of 
normality) if there were guidelines for adjusting h for various nonnormal situations. Due to ttie absence of 
guidelines for correction of h for nonnormally distributed statistics, a simulation study was performed to 
determine upper and lower bounds for the CUSUM using 7^ through 7^. 



Two Simulation Studies 



Two simulation studies were conducted to get a first impression of the usefulness of the 
CUSUM-procedure in the context of person-fit analysis. First, in Study 1, the CUSUM procedure was applied 
to different types of nonfitting score patterns; second, in Study 2, it was investigated whether the numerical 
values of the upper and lower threshold of the CUSUM-procedure depend on ttie 0-level. 

Study 1 



Method 



In this study the CUSUM-procedure described in Equations 6 through 8 was performed using three 
different types of simulated adaptive response vectors: one normal response vector and two norifitting 
response vectors. To get an idea of the usefulness of the CUSUM-procedure the statistics and T were 
used; these statistics are illustrative for the other statistics which are functions of and 'i* Two types of 
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aberrant response behavior were considered. First, guessing on all items in the test was simulated. This type 
of aberrant behavior is realistic when an test taker only takes a CAT to become familiar with the item content 
and randomly guesses the correct answer with a probability of 0.2 (assuming a test with five alternatives per 
item). Second, a response vector was simulated using two different 0-values to respond to the test. Examples 
of this aberrant behavior are sleeping, carelessness, and preknowledge about some items in the test. A 
response vector was simulated assuming that the simulated test taker responds to the first half of the test 
with 9 , and to the second half of the test with 9 where the correlation between the two abilities is 
characterized by a parameter p. Here the value p = 0 was taken. A pool of 400 items fitting the 2-PLM 
with -- N(l; 0.2) and b/ U(-3; 3) was used to simulate the three adaptive response vectors. 

The normal response vector was simulated as follows. First, the true 0 of a simulated test taker was set to 
a fixed 0-level, here 0 = 2. Then, the first item of the CAT selected was the item with maximum information 
given 0 = 0. For this item, P (0), according to Equation 3 was determined. To simulate the answer (1 or 0), a 
random number y from the uniform distribution on the interval [0,1] was drawn; when y < P (0) the response 
to item f was set to 1 (correct response), 0 otherwise. The first four items of the CAT were selected with 
maximum information for 0 = 0, and based on the responses to these four items, 0 was obtained using 
weighted maximum likelihood estimation (Warm, 1989). The next item selected was the item with maximum 
information given 0 at that stage. For this item, P (0) was computed, a response was simulated, 0 was 
estimated and another item was selected based on maximum information given 0 at that stage. This 
procedure was repeated until the test attained the length of 20 items. 

The response vector generated by a simulated test taker guessing on all items was simulated as follows. 
The first item of the CAT selected was the item with maximum information given 0 = 0. To simulate the 
answer (1 or 0), a random number y from the uniform distribution on the interval [0, 1] was drawn; when y 
< P (0) = 0.2 the response to item / was set to 1 (correct response), 0 otherwise. The first four items of the ^ 
CAT were selected with maximum information for 0 = 0, and based on the responses to these four items, 0 
was obtained using weighted maximum^ likelihood estimation (Warm 1989). The next item selected was the 
item with maximum information given 0 at that stage. For this item, a response was simulated with P (0) = 
0.2, 0 was estimated and another item was selected based on maximum information given 0 at that stage. 
This procedure was repeated until the test attained the length of 20 items. 

The simulated test taker with a two-dimensional ability parameter, with p = 0, was simulated by first 
drawing two values 0^ and 9^ from the standard normal distribution, with correlation p = 0. Then, the first 
item of the CAT selected was the item with maximum information given 0 = 0. For the responses to the first 
10 items P(0,) was used, and for the last 10 items P(0j) was used. 

Results 

In Figures 1 and 2 the CUSUM-procedure was plotted for the normal response vector with 0 = 2 using 
statistics and T^, respectively. The difference between 7^ and w^as that in T, 0^_^ was used; that is, the 
estimated 0 obtained from the previous stage in the CAT, and in T^, 0 was used, that is, the final estimate. 
Figure 1 showed an increase in during the first seven items. This can be explained by the fact that this 
simulated test taker had a high 0- value; in the beginning of the CAT, the simulated test taker responded to 
relatively easy items given his/her 0- value and as a result a sequence of positive residuals, that is, correct 
responses, appeared. Later on in the CAT, the items became more and more difficult; the alternation of zeros 
and ones was reflected by increasing and decreasing at the same time. Figure 2 reflected the same 
process, but now the increase in the first seven items was much smaller. This was because, a simulated test 
taker with a high 0-value will, in general, also have a high final estimate (0); due to the use of this final 
estimate, the probability of a correct response to the first seven (easy) items was close to 1, and therefore, the 
residual became close to zero. In contrast, using 0^_j (Figure 1), resulted in a probability of a correct response 
close to 0.5, because here the starting point of estimating 0 was set to 0. 
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FIGURE 1. CUSUM of a normal response vector, 6 = 2 using statistic T1 




FIGURE 2. CUSUM of a normal response vector, 6=2 using statistic T4 
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In Figures 3 and 4 the CUSUM-procedure was plotted for the guessing response vector using statistic 
and T^, respectively. In Figure 3 the sequence of consecutive incorrect answers was reflected by decreasing 
S^. In Figure 4 the decrease of was smaller, again, due to the use of 0. A guessing simulated test taker will 
obtain a relatively low 0, and almost all responses will be incorrect. As a result P (0) will be close to 0 for the 
first items, and the residuals of the incorrect answers will become negative, but close to 0. This also 
explained the large increase in Figure 4 for S^: the probability of a correct response was close to 0 for the first 
items and correct responses then resulted in residuals close to 1. 




FIGURE 3. CLZSLZM of a "guessing" response vector using statistic T1 




FIGURE 4. CUSUM of a "guessing" response vector using statistic T4 
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In Figures 5 and 6 the CUSUM-procedure was plotted for the response vector with a two-dimensional 6, 
using statistics and T^, respectively. Figure 5 showed first an increase of followed by a decrease of S^. 
This phenomenon suggested a higher 0-value during the first half of the CAT than during the second half of 
the test. In Figure 6 the increase during the first half of the test was smaller, but still present, again due to the 
use of 6. 




FIGURE 5. CUSUM of a response vector with p = 0 using statistic Tl. 




FIGURE 6. CUSUM of a response vector with p = 0 using statistic T4 
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Study 2 



Method 

In this simulation study it was investigated whether the numerical values of the upper and lower 
threshold of the CUSUM-procedure depend on the 0-level. When these boundaries are independent of 6, the 
results of the CUSUM-procedure are comparable across simulated test takers. 

Ten datasets consisting of normal adaptive response vectors were constructed. Five datasets of 200 
response vectors were constructed at five different 0-levels, and five datasets of 1,000 response vectors were 
constructed at five different 0-levels, where 0 = -2, -1, 0, 1, and 2. Two different sizes of the datasets were 
used to investigate the influence of sample size on the determination of the boundaries, LZB and LB, of the 
CUSUM-procedures. Different 0-levels were used to investigate the numerical values of LZB and LB across 
0-levels. 

A pool of 400 items fitting the 2-PLM with - N(l, 0.2) and LZ(-3, 3) was used to simulate the 
adaptive response vectors. An adaptive response pattern was simulated as described in Study 1, where the 
true 0 of a simulated test taker was set to a fixed 0-level, dependent on the dataset constructed. For each 
simulated test taker, six different statistics, through 7^, were used in the CUSUM-procedure described in 
Equations 6, 7, and 8. Then, for each simulated test taker and for each statistic. 



max = max (Sf ) and 


(18) 


min = mm (Sf ) 


(19) 



were determined resulting in 200 or 1,000 values of and for each dataset and for each statistic. Then, 
for each dataset, the upper threshold, LZB, was determined as the value of max S” for which 5% of the 
simulated test takers had higher max S^-values and the lower threshold, LB, was determined as the value of 
min for which 5% of the simulated test takers had lower min S^-values. That is, a two-sided test at a = 

0.10 was conducted, where P (max > UB) = P fmin < Lg) = 0.05. In other words, for each statistic two 
boimdaries (upper and lower bound) per dataset were determined, where 10% of the simulated test takers 
attained values outside these boundaries. 

Results 

In Figures 7 and 8 the numerical values of the upper and lower bound, LZB and LB, of the 
CUSUM-procedure using statistics and T^, respectively, were plotted against 0. Thus, the influence of the 
use of 0, _i in compared with 0 in was reflected in Figure 7 and Figure 8. Figure 7 showed that UB of 
the CUSUM using was approximately the same across 0-levels and across sample size. The LB of the 
CUSUM using T was less stable across 0-levels and across sample sizes; the differences across sample sizes 
were largest for 0 = 1 and 0 = 2. For 7^ (Figure 8) the effect of the sample size on the use of statistic 'T on both 
LZB andXB was negligible; however, for LB, there was a slight difference across 0-levels between sample 
sizes. The values of LZB and LB were both nearly equivalent across 0-levels. Figures 7 and 8 also shewed that 
the LZB and LB of were more stable than 7^. These figures also showed that, when 0 was used in i, the 
UB was slightly lower than when 0,_j was used in 7^, whereas LB was approximately the same. 
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FIGURE 7. Upper and lower bound ofCUSUMfor Tl, where a = 0.10 
(two-sided) 
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FIGURE 8. Upper and lower bound ofCUSUMfor T4, where a = 0.10 
(two-sided) 
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In Figures 9 and 10 the numeric^ values of the upper and lower bound, UB and LB of the 
CUSUM-procedure using statistics 7^ and 1^, respectively, were plotted against 9. Figure 9 showed that the 
values of UB were almost equivalent across 0-levels for both sample sizes, and that tihe values of LB were 
slightly different across 0-levels and across sample sizes; again, for 0 = 1 and 0 = 2 there were differences 
across sample sizes. Figure 10 showed that when using 0 in 7^ instead of 0;., in the differences across 
sample size became smaller. 
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FIGURE 9. Upper and lower bound ofCUSUMfor T2, where 
a = 0.10 (two-sided) 
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FIGURE 10. Upper and lower bound ofCUSUMfor T5, where a = 0,10 
(two-sided) 
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Finally, in Figures 11 and 12 the numerical values of the upper and lower bound, UB and LB, of the 
CUSUM-procedure using statistics 7^ and T®, respectively, were plotted against 6. Here the us^f 0 
(T^)resulted in a larger fluctuation in the numerical values of UB and LB than the use of (7^). Figure 11 
showed that both UB and LB were nearly equivalent across different 0-levels and across sample size, 
whereas the UB and LB of 7^ were quite different across 0-levels. Figure 11 also showed that the influence of 
sample size on the distribution of UB was negligible. However, the distribution of LB was slightly different 
across sample sizes. 
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FIGURE 11. Upper and lower bound ofCUSUM for T3, where a = 0.10 
(two-sided) 




UB 200 simulated 
test lakers 



LB 200 simulated UB 1000 simulaUd LB 1000 simulated 

test lakers test takers test takers 



FIGURE 12. Upper and lower bound ofCUSUM for T6, where a = 0.10 
(two-sided) 
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Discussion 

In this study, we explored the usefulness of CUSUM-procedures in the context of person-fit analysis. 
Several statistics were proposed and their behavior was investigated in a small simulation study (Study 1). 

In Study 1 the effects of particular aberrant behavior was investigated. Results show that a 
CUSUM-procedure is sensitive to nonfittingjesponse vectors and that to distinguish normal from aberrant 
response vectors it may be advisable to use 0 instead of 0^.jbecause the statistics using 0^.^are instable in the 
first part of the CAT. To classify a pattern as aberrant an upper and lower threshold should be determined. 
Therefore, in Study 2 it was investigated whether the numerical values of the upper and lower bound were 
dependent on 0. It was shown ^at, for all statistics, the upper and lower bound were rather stable across 
0-levels. Also, the use of 0 and 0^_ j was examined. Results showed differences in the numerical values of the 
upper and lower bound between statistics using 0 and 0^_^ . 

In future research the detection rates for several different types of aberrant response behavior when 
CUSUM-procedures were used to classify response patterns as nonfitting will be compared with other 
person-fit statistics. 
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